Activity Recognition of Weight Lifting Exercises - Simple implimentation with R

Anik Mallick

2019-08-23

Introduction

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).
Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes. Participants were supervised by an experienced weight lifter to make sure the execution complied to the manner they were supposed to simulate. The exercises were performed by six male participants aged between 20-28 years, with little weight lifting experience. It was made sure that all participants could easily simulate the mistakes in a safe and controlled manner by using a relatively light dumbbell (1.25kg).
My goal here is to predict the “class” with the help of other predictors. This project is a part of Coursera Practical Machine Learning Week 4 - Peer-graded Assignment: Prediction Assignment Writeup.

Data

Load the data

Let’s load the data. I have downloaded the data already on my local system. Please download the data from here : Training and Testing. And run this code on the same directory as the data.

## [1] 19622   160
## [1]  20 160

Let’s create a validation for model tuning:

## [1] 15699   160
## [1] 3923  160

Now 3 partition of our data is ready, lets dive into analysis but 1st lets look at the proportion of different “classe”:

## 
##         A         B         C         D         E 
## 0.2843493 0.1935155 0.1744060 0.1638958 0.1838334

From the above it is clear that there are not that much bias in the data in term of different “classe”.

Column overview

The data has 160 columns and for training data 15699 rows. Data was collected with the help of 4 sensors, shown in the below diagram (diagram source).

Few Key points about the columns:

  • “X” is primary key for the data.
  • “user_name” is the id of the users. This may help us see interesting patterns for each activity for different users.
  • “classe” is the target for prediction.
  • Column - 3 to 7 is not necessary for this project. (5 features)
  • As mentioned above there are 4 different sensors used for data collection. For each sensor there are 38 different features.
  • Each sensor(“belt”,“arm”,“forearm”,“dumbbell”) has raw accelerometer, gyroscope and magnetometer readings for x, y and z axis. (4 sensor * 3 feature * 3 axis = 36 features)
  • Each sensor(“belt”,“arm”,“forearm”,“dumbbell”) has Euler angles (roll, pitch and yaw) feature.(4 sensor * 3 euler angles = 12 features)
  • For the Euler angles of each of the four sensors eight features were calculated: mean, variance, standard deviation, max, min, amplitude, kurtosis and skewness. (4 sensor * 3 feature * 8 measures = 96 features)
  • For accelerometer we also have “total” and “variance of total” feature for the 4 sensors. But for “belt”, “variance of total” is given as “var_total_accel_belt”, for the other sensors it is given as (“var_accel_arm”,“var_accel_dumbbell”,“var_accel_forearm”). So I am considering the “belt” one as a typo. (4 sensor * 2 feature = 8 features)
  • There is another thing to note here. For “belt” Euler angles feature skewness is given as “skewness_roll_belt”, “skewness_roll_belt.1” and “skewness_yaw_belt”. I am also considering “skewness_roll_belt.1” as a typo and considering it as “skewness_pitch_belt”.

Missingness in the data

Let’s take a quick look at the missingness of the data. As the no of feature is large, its better to see them by the 4 sensors:

Belt

For Belt sensor:

##            roll_belt           pitch_belt             yaw_belt 
##                    0                    0                    0 
##     total_accel_belt   kurtosis_roll_belt  kurtosis_picth_belt 
##                    0                15396                15413 
##    kurtosis_yaw_belt   skewness_roll_belt skewness_roll_belt.1 
##                15699                15395                15413 
##    skewness_yaw_belt        max_roll_belt       max_picth_belt 
##                15699                15388                15388 
##         max_yaw_belt        min_roll_belt       min_pitch_belt 
##                15396                15388                15388 
##         min_yaw_belt  amplitude_roll_belt amplitude_pitch_belt 
##                15396                15388                15388 
##   amplitude_yaw_belt var_total_accel_belt        avg_roll_belt 
##                15396                15388                15388 
##     stddev_roll_belt        var_roll_belt       avg_pitch_belt 
##                15388                15388                15388 
##    stddev_pitch_belt       var_pitch_belt         avg_yaw_belt 
##                15388                15388                15388 
##      stddev_yaw_belt         var_yaw_belt         gyros_belt_x 
##                15388                15388                    0 
##         gyros_belt_y         gyros_belt_z         accel_belt_x 
##                    0                    0                    0 
##         accel_belt_y         accel_belt_z        magnet_belt_x 
##                    0                    0                    0 
##        magnet_belt_y        magnet_belt_z 
##                    0                    0

Arm

For Arm sensor:

##            roll_arm           pitch_arm             yaw_arm 
##                   0                   0                   0 
##     total_accel_arm       var_accel_arm        avg_roll_arm 
##                   0               15388               15388 
##     stddev_roll_arm        var_roll_arm       avg_pitch_arm 
##               15388               15388               15388 
##    stddev_pitch_arm       var_pitch_arm         avg_yaw_arm 
##               15388               15388               15388 
##      stddev_yaw_arm         var_yaw_arm         gyros_arm_x 
##               15388               15388                   0 
##         gyros_arm_y         gyros_arm_z         accel_arm_x 
##                   0                   0                   0 
##         accel_arm_y         accel_arm_z        magnet_arm_x 
##                   0                   0                   0 
##        magnet_arm_y        magnet_arm_z   kurtosis_roll_arm 
##                   0                   0               15446 
##  kurtosis_picth_arm    kurtosis_yaw_arm   skewness_roll_arm 
##               15448               15398               15445 
##  skewness_pitch_arm    skewness_yaw_arm        max_roll_arm 
##               15448               15398               15388 
##       max_picth_arm         max_yaw_arm        min_roll_arm 
##               15388               15388               15388 
##       min_pitch_arm         min_yaw_arm  amplitude_roll_arm 
##               15388               15388               15388 
## amplitude_pitch_arm   amplitude_yaw_arm 
##               15388               15388

Forearm

For Forearm sensor:

##            roll_forearm           pitch_forearm             yaw_forearm 
##                       0                       0                       0 
##   kurtosis_roll_forearm  kurtosis_picth_forearm    kurtosis_yaw_forearm 
##                   15448                   15449                   15699 
##   skewness_roll_forearm  skewness_pitch_forearm    skewness_yaw_forearm 
##                   15447                   15449                   15699 
##        max_roll_forearm       max_picth_forearm         max_yaw_forearm 
##                   15388                   15388                   15448 
##        min_roll_forearm       min_pitch_forearm         min_yaw_forearm 
##                   15388                   15388                   15448 
##  amplitude_roll_forearm amplitude_pitch_forearm   amplitude_yaw_forearm 
##                   15388                   15388                   15448 
##     total_accel_forearm       var_accel_forearm        avg_roll_forearm 
##                       0                   15388                   15388 
##     stddev_roll_forearm        var_roll_forearm       avg_pitch_forearm 
##                   15388                   15388                   15388 
##    stddev_pitch_forearm       var_pitch_forearm         avg_yaw_forearm 
##                   15388                   15388                   15388 
##      stddev_yaw_forearm         var_yaw_forearm         gyros_forearm_x 
##                   15388                   15388                       0 
##         gyros_forearm_y         gyros_forearm_z         accel_forearm_x 
##                       0                       0                       0 
##         accel_forearm_y         accel_forearm_z        magnet_forearm_x 
##                       0                       0                       0 
##        magnet_forearm_y        magnet_forearm_z 
##                       0                       0

Dumbbell

For Dumbbell sensor:

##            roll_dumbbell           pitch_dumbbell             yaw_dumbbell 
##                        0                        0                        0 
##   kurtosis_roll_dumbbell  kurtosis_picth_dumbbell    kurtosis_yaw_dumbbell 
##                    15392                    15390                    15699 
##   skewness_roll_dumbbell  skewness_pitch_dumbbell    skewness_yaw_dumbbell 
##                    15391                    15389                    15699 
##        max_roll_dumbbell       max_picth_dumbbell         max_yaw_dumbbell 
##                    15388                    15388                    15392 
##        min_roll_dumbbell       min_pitch_dumbbell         min_yaw_dumbbell 
##                    15388                    15388                    15392 
##  amplitude_roll_dumbbell amplitude_pitch_dumbbell   amplitude_yaw_dumbbell 
##                    15388                    15388                    15392 
##     total_accel_dumbbell       var_accel_dumbbell        avg_roll_dumbbell 
##                        0                    15388                    15388 
##     stddev_roll_dumbbell        var_roll_dumbbell       avg_pitch_dumbbell 
##                    15388                    15388                    15388 
##    stddev_pitch_dumbbell       var_pitch_dumbbell         avg_yaw_dumbbell 
##                    15388                    15388                    15388 
##      stddev_yaw_dumbbell         var_yaw_dumbbell         gyros_dumbbell_x 
##                    15388                    15388                        0 
##         gyros_dumbbell_y         gyros_dumbbell_z         accel_dumbbell_x 
##                        0                        0                        0 
##         accel_dumbbell_y         accel_dumbbell_z        magnet_dumbbell_x 
##                        0                        0                        0 
##        magnet_dumbbell_y        magnet_dumbbell_z 
##                        0                        0

So it is very interesting to see that few of the features are over 90% missing, I would drop those columns for further analysis. But the interesting thing is that all of those columns have same no of NA values.

## [1] 100

So we can drop 100 column as they are mostly missing. After we drop these column there will be 52 predictors left.

Analysis

Now lets get into analysis, first let’s look at the correlation among the predictors.

## [1] 15699    53

Correlation among predictors

Here I have subsetted the data to show only the columns for which absolute correlation is higher than 0.8 with at least one other column. From Correlation plot it is clear that there is lot of columns that are highly correlated. That might be an issue when we will be in modeling phase. Either we can drop those columns or we can perform PCA(Principal Components Analysis). One important thing to note from this graph is that high correlation is only seen between the same sensor i.e. “belt”,“arm”,“forearm” and “dumbbell”.

Correlation with the target

As the target is a categorical variable, we cannot check correlation with the other variables directly. But we can use correlationfunnel::correlate to see the correlation with each level of“classe” and other features. Lets go by them one by one.

classe__A

For *classe__A* it seems that the “Arm and Forearm” sensors are more important.

  • “accel_arm_x” is correlated with “magnet_arm_x”, so wont consider.
  • “gyros_arm_y” is correlated with “gyros_arm_x”, so wont consider.
  • So top 5 significant features for "classe__A" are - (magnet_arm_x, pitch_forearm , magnet_dumbbell_y, roll_forearm, gyros_dumbbell_y)

classe__B

For *classe__B* it seems that the “Dumbbell and Belt” sensors are more important.

  • So top 5 significant features for "classe__A" are - (magnet_dumbbell_y, magnet_dumbbell_x , roll_dumbbell , magnet_belt_y , accel_dumbbell_x )

classe__C

For *classe__C* it seems that the “Dumbbell” sensors are more important.

  • So top 5 significant features for "classe__A" are - (magnet_dumbbell_y, roll_dumbbell , accel_dumbbell_y , magnet_dumbbell_x, magnet_dumbbell_z)

classe__D

For *classe__D* it seems that the “Forearm, Arm and Dumbbell” sensors are more important.

  • So top 5 significant features for "classe__A" are - (pitch_forearm , magnet_arm_y , magnet_forearm_x, accel_dumbbell_y, accel_forearm_x)

classe__E

For *classe__E* it seems that the “Belt” sensors are more important.

  • “total_accel_belt” is correlated with “roll_belt”, so wont consider.
  • “yaw_belt” is correlated with “roll_belt”, so wont consider.
  • “accel_belt_z” is correlated with “roll_belt”, so wont consider.
  • So top 5 significant features for "classe__A" are - (magnet_belt_y , magnet_belt_z , roll_belt, gyros_belt_z , magnet_dumbbell_y)

Let’s make some plots

This document is already too long coursera assignment, so for this section I’ll work on top 5 features for each class selected in the last section. So lets select only those columns.

##   arm forearm belt dumbbell
## 1   2       4    4        7

One interesting thing to note here is that the dumbbell sensor turned out to be the most important sensor among the 4. I would like to explore that in future works.

Its model time

In the above section we have narrowed down to 17 predictors and also we have decided to use 3 preprocessing steps. In this section we will build on the analyzed data to create models for prediction. We will use Classification tree, Random Forest, Generalized Linear regression and SVM , them we will stack it with a Random forest to get the final model.

Let’s look the results:

Classification Tree

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1003  330  319  294  106
##          B   19  256   20  109  103
##          C   93  173  345  240  212
##          D    0    0    0    0    0
##          E    1    0    0    0  300
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4853          
##                  95% CI : (0.4696, 0.5011)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3271          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8987  0.33729  0.50439   0.0000  0.41609
## Specificity            0.6263  0.92067  0.77833   1.0000  0.99969
## Pos Pred Value         0.4888  0.50493  0.32455      NaN  0.99668
## Neg Pred Value         0.9396  0.85275  0.88147   0.8361  0.88377
## Prevalence             0.2845  0.19347  0.17436   0.1639  0.18379
## Detection Rate         0.2557  0.06526  0.08794   0.0000  0.07647
## Detection Prevalence   0.5231  0.12924  0.27097   0.0000  0.07673
## Balanced Accuracy      0.7625  0.62898  0.64136   0.5000  0.70789

Clearly Classification tree is not performing well, accuracy is very low. One thing to note here is that True classe_A are detected with high accuracy, but other classe are incorrectly predicted as classe_A.

Random Forest

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1112    7    0    0    0
##          B    3  741    5    3    1
##          C    1    7  676   15    4
##          D    0    4    3  625    1
##          E    0    0    0    0  715
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9862          
##                  95% CI : (0.9821, 0.9896)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9826          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9964   0.9763   0.9883   0.9720   0.9917
## Specificity            0.9975   0.9962   0.9917   0.9976   1.0000
## Pos Pred Value         0.9937   0.9841   0.9616   0.9874   1.0000
## Neg Pred Value         0.9986   0.9943   0.9975   0.9945   0.9981
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2835   0.1889   0.1723   0.1593   0.1823
## Detection Prevalence   0.2852   0.1919   0.1792   0.1614   0.1823
## Balanced Accuracy      0.9970   0.9862   0.9900   0.9848   0.9958

Random Forest took the lead with 98%+ accuracy.

GBM

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1083   40    2    5    3
##          B   18  642   32   16   11
##          C    8   54  635   39   10
##          D    4   21   14  582    9
##          E    3    2    1    1  688
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9253          
##                  95% CI : (0.9166, 0.9333)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9055          
##                                           
##  Mcnemar's Test P-Value : 2.509e-07       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9704   0.8458   0.9284   0.9051   0.9542
## Specificity            0.9822   0.9757   0.9657   0.9854   0.9978
## Pos Pred Value         0.9559   0.8929   0.8512   0.9238   0.9899
## Neg Pred Value         0.9882   0.9635   0.9846   0.9815   0.9898
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2761   0.1637   0.1619   0.1484   0.1754
## Detection Prevalence   0.2888   0.1833   0.1902   0.1606   0.1772
## Balanced Accuracy      0.9763   0.9108   0.9470   0.9452   0.9760

Clearly GBM is also doing good but RF is still the best.

SVM

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1096   40   18   17    2
##          B    1  676   15    5    6
##          C    9   40  640   45    3
##          D   10    3    9  575    9
##          E    0    0    2    1  701
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9401          
##                  95% CI : (0.9322, 0.9473)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9241          
##                                           
##  Mcnemar's Test P-Value : 1.808e-15       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9821   0.8906   0.9357   0.8942   0.9723
## Specificity            0.9726   0.9915   0.9701   0.9905   0.9991
## Pos Pred Value         0.9344   0.9616   0.8684   0.9488   0.9957
## Neg Pred Value         0.9927   0.9742   0.9862   0.9795   0.9938
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2794   0.1723   0.1631   0.1466   0.1787
## Detection Prevalence   0.2990   0.1792   0.1879   0.1545   0.1795
## Balanced Accuracy      0.9773   0.9411   0.9529   0.9424   0.9857

So it clear that SVM is giving good accuracy, though it not as good as random forest.

Results

So from the above analysis it is clear that Random Forest is taking the lead in term of prediction. Now lets see how good it does in the Test set given in the Coursera project.

##    problem_id PREDICTION_RF PREDICTION_GBM PREDICTION_SVM
## 1           1             E              E              C
## 2           2             A              E              A
## 3           3             A              D              B
## 4           4             E              E              C
## 5           5             E              E              A
## 6           6             E              D              C
## 7           7             E              E              B
## 8           8             B              D              A
## 9           9             A              B              E
## 10         10             E              E              E
## 11         11             A              E              C
## 12         12             A              D              C
## 13         13             E              B              E
## 14         14             A              D              B
## 15         15             E              E              B
## 16         16             E              E              A
## 17         17             E              E              C
## 18         18             B              E              A
## 19         19             E              E              A
## 20         20             E              E              E

This is odd- UPDATE

My validation accuracy is good but my score in the quize was not correct. Maybe I should have not droped the columns when I selected 17 features. As Random Forest is doing so go I’ll train one more RF with all the data and see if it helps. But still I dont understand why even after my accuracy on validation is above 95, how it is so bad. Need to look into it.

So for this update I will use all the predictors. And as there are so many columns I’ll make it parallel so it doesnot take that long.

## Loading required package: foreach
## Loading required package: iterators
## Loading required package: parallel
## [1] 3
##    problem_id PREDICTION_RF PREDICTION_GBM PREDICTION_SVM
## 1           1             E              E              C
## 2           2             A              E              A
## 3           3             A              D              B
## 4           4             E              E              C
## 5           5             E              E              A
## 6           6             E              D              C
## 7           7             E              E              B
## 8           8             B              D              A
## 9           9             A              B              E
## 10         10             E              E              E
## 11         11             A              E              C
## 12         12             A              D              C
## 13         13             E              B              E
## 14         14             A              D              B
## 15         15             E              E              B
## 16         16             E              E              A
## 17         17             E              E              C
## 18         18             B              E              A
## 19         19             E              E              A
## 20         20             E              E              E
##    PREDICTION_RF2_ALL_COL
## 1                       B
## 2                       A
## 3                       B
## 4                       A
## 5                       A
## 6                       E
## 7                       D
## 8                       B
## 9                       A
## 10                      A
## 11                      B
## 12                      C
## 13                      B
## 14                      A
## 15                      E
## 16                      E
## 17                      A
## 18                      B
## 19                      B
## 20                      B

This time I did better. May be because the test size is so small my previous models did so bad because my validation accuracy was ok.